Adventures in Data Science
Overview
This tutorial covers the basics of using the Command Line and Git to track and record changes to files on your local computer. It provides background information that will help you to better understand the concepts that we will discuss in class and to better participate in the hands-on portion of the course.
Working with the Command Line
Most users interact with their computer through a Graphical User Interface (GUI) that allows them to use a mouse, keyboard, and graphical elements on screen (such as file menus, pictures of folders and files, etc.) to perform their work. Users tend to conflate their Operating System and their GUI because computer hardware and software manufacturers tightly pack these two concerns as a convenience to users. But the Windows 10 or Mac Big Sur operating system that makes your computer work and the Windows 10 or Mac Big Sur GUI that you interact with are, in fact completely different and separable software packages and it is possible to use different methods/software to interact with your computer than the stock, tightly coupled GUI that launches automatically when you turn on your computer.
Because computer manufacturers like Windows and Mac devote so many resources to the development of their system GUIs, there are few viable (at present, none, commercially available) competing GUIs for these platforms. This is not the case in the Linux world, however, where users have several system GUI packages from which to choose and can seamlessly switch between them as desired. Despite the lack of competition/choice on the GUI front when it comes to interacting with your computer, there are other, non-graphical ways of communicating directly with your operating system that exist for all operating systems. We call these “Command Line” interfaces. The Command Line offers a text-only, non graphical means of interacting with your computer. In the early days of computing, all user interaction with the computer happened at the command line. In the current days of graphical user interfaces, using the Command Line requires you to launch a special program that provides Command Line access.
Mac users will use an application called “Terminal” which ships by default with the Mac operating system. To launch the Terminal application, go to:
Applications -> Utilities -> Terminal
When you launch the application, you will see something like this:
Windows users will use an application called Git Bash, which was installed on your system when you installed Git. To launch Git Bash, go to:
Click on the Windows Start Menu and search for “Git Bash”
Alternatively,
Click on the Windows Start Menu, select Programs, and browse to Git Bash
When you launch the application, you will see something like this:
Interacting with the Command Line
While it can look intimidating to those raised on the GUI, working with the Command Line is actually quite simple. Instead of pointing and clicking on things to make them happen, you type written commands.
The figure below shows a new, empty Command Line Interface in the Mac Terminal application
The Command Line prompt contains a lot of valuable information. The beginning of the line, “(base) MacPro-F5KWP01GF694” tells us exactly which computer we are communication with. This may seem redundant, but it is actually possible to interact with computers other than the one you are typing on by connecting to them via the Command Line over the network.
The bit of information after the colon, in this example the “~” character tells us where in the computer’s filesystem we are. We’ll learn more about this later, for now you need to undersant that the “~” character means that you are in your home directory.
The next piece of information we are given is the username under which we are logged into the computer, in this case, my local username, “cstahmer”.
After the username, we see the “$” character. This is known as the Command Prompt. It is an indicator that the Command Line application is waiting for you to enter something. The Command Prompt character is used througout these materials when giving command examples. When working through materials, DO NOT ENTER the Command Prompt. It will already be there telling you that the computer is ready to receive your command.
Depending on your system and/or Command Line interface, you may or may not also see a solid or flashing box that appears after the Command Prompt. This is a Cursor Position Indicator, which tells you where the current cursor is in the terminal. This is useful if you need to go gack and correct an error. Generally speaking, you can’t click a mouse in a terminal app to edit text. You need to use your computer’s right and left arrows to move the cursor to the correct location and then make your edit.
As noted earlier, we interact with the Command Line by typing commands. The figure below shows an example of a simple command, “echo” being entered into the Command Line.
The “echo” command prints back to screen any text that you supply to the command It literally echoes your text. To execute, this or any command, you simply hit the “return” or “enter” key on your keyboard. You’ll see that when you execute a Command Line command the sytem performs the indicated operation, prints any output from the operation to screen and then delivers a new Command Line prompt.
Note that depending on your particular system and/or Command Line interface, things might look slightly different on your computer. However, the basic presentation and function as described above will be the same.
Common Command Line Commands
During our hands-on, in-class session we will practice using the following Command Line commands. Be prepared to have this page ready as a reference during class to make things easier.
| Command Name | Function | |
|---|---|---|
| ls | List | Lists all files in the current directory. |
| ls -l | List with Long flag | Lists additional information about each file. |
| ls -a | List with All flag | Lists all files, including hidden files. |
| pwd | Print Working Directory | Prints the current working directory. |
| mkdir | Make Directory | Creates a new file directory. |
| cd | Change Directory | Navigates to another directory on the file system. |
| mv | Move | Moves files. |
| cp | Copy | Copies files. |
| rm | Remove/delete | Deletes files. |
For a more complete list of Unix Commands, see the Unix Cheat Sheet.
Command Line Text Editors
The Command Line also features a variety of different text editors, similar in nature to Microsoft Word or Mac Pages but much more stripped down. These editors are only accessible from the Command Line; we won’t spend very much time with them, but it is important to know how to use them so that you can open, read, and write directly in the Command Line window.
Macs and Git Bash both ship with a text editor called Vim (other common editors include Emacs and Nano). To open a file with vim, type vi in a Command Line window, followed by the filename. If you want to create a new file, simply type the filename you’d like to use for that file after vi.
Vim works a bit differently than other text editors and word processors. It has a number of ‘modes,’ which provide different forms of interaction with a file’s data. We will focus on two modes, Normal mode and Insert. When you open a file with Vim, the program starts in Normal mode. This mode is command-based and, somewhat strangely, it doesn’t let you insert text directly in the document (the reasons for this have to do with Vim’s underlying design philosophy: we edit text more than we write it on the Command Line).
To insert text in your document, switch to Insert mode by pressing i. You can check whether you’re in Insert mode by looking at the bottom left hand portion of the window, which should read -- INSERT --.
Once you are done inserting text, pressing ESC (the Escape key) will bring you back to Normal mode. From here, you can save and quit your file, though these actions differ from other text editors and word processors: saving and quitting with Vim works through a sequence of key commands (or chords), which you enter from Normal mode.
To save a file in Vim, make sure you are in Normal mode and then enter :w. Note the colon, which must be included. After you’ve entered this key sequence, in the bottom left hand corner of your window you should see “[filename] XL, XC written” (L stands for “lines” and C stands for “characters”).
To quit Vim, enter :q. This should take you back to your Command Line and, if you have created a new file, you will now see that file in your window.
If you don’t want to save the changes you’ve made in a file, you can toss them out by typing :q! in place of :w and then :q. Also, in Vim key sequences for save, quit, and hundreds of other commands can be chained together. For example, instead of separately inputting :w and :q to save and quite a file, you can use :wq, which will produce the same effect. There are dozens of base commands like this in Vim, and the program can be customized far beyond what we need for our class. More information about this text editor can be found here.
Basic Vim Commands
| Command | Function |
|---|---|
| esc | Enter Normal mode. |
| i | Enter Insert mdoe. |
| :w | Save. |
| :q | Quit. |
| :q! | Quit without saving. |
For a more complete list of Vim commands, see this Cheat Sheet.
Introduction to Version Control
This section covers the basics of using Version Control Software (VCS) to track and record changes to files on your local computer. It provides background information that will help you to better understand what VCS is, why we use it, and how it does its work.
What is Version Control?
Version control describes a process of storing and organizing multiple versions (or copies) of documents that you create. Approaches to version control range from simple to complex and can involve the use of various human workflows and/or software applications to accomplish the overall goal of storing and managing multiple versions of the same document(s).
Most people have a folder/directory somewhere on their computer that looks something like this:
Or perhaps, this:
This is a rudimentary form of version control that relies completely on the human workflow of saving multiple versions of a file. This system works minimally well, in that it does provide you with a history of file versions theoretically organized by their time sequence. But this filesystem method provides no information about how the file has changed from version to version, why you might have saved a particular version, or specifically how the various versions are related. This human-managed filesystem approach is more subject to error than software-assisted version control systems. It is not uncommon for users to make mistakes when naming file versions, or to go back and eit files out of sequence. Software-assisted version control systems (VCS) such as Git were designed to solve this problem.
Software Assisted Version Control
Version control software has its roots in the software development community, where it is common for many coders to work on the same file, sometimes synchronously, amplifying the need to track and understand revisions. But nearly all types of computer files, not just code, can be tracked using modern version control systems. IBM’s OS/360 IEBUPDTE software update tool is widely regarded as the earliest and most widely adopted precursor to modern, version control systems. Its release in 1972 of the Source Code Control System (SCCS) package marked the first, fully fledged system designed specifically for software version control.
Today’s marketplace offers many options when it comes to choosing a version control software system. They include systems such as Git, Visual Source Safe, Subversion, Mercurial, CVS, and Plastic SCM, to name a few. Each of these systems offers its twist on version control, differing sometimes in the area of user functionality, sometimes in how it handles things on the back-end, and sometimes both. This tutorial focuses on the Git VCS, but in the sections that follow we offer some general information about classes of version control systems to help you better understand how Git does what it does and help you make more informed decisions about how to deploy it for you own work.
Local vs Server Based Version Control
There are two general types of version control systems: Local and Server (sometimes called Cloud) based systems. When working with a Local version control system, all files, metadata, and everything associated with the version control system live on your local drive in a universe unto itself. Working locally is a perfectly reasonable option for those who work independently (not as part of a team), have no need to regularly share their files or file versions, and who have robust back-up practices for their local storage drive(s). Working locally is also sometimes the only option for projects involving protected data and/or proprietary code that cannot be shared.
Server based VCS utilize software running on your local computer that communicates with a remote server (or servers) that store your files and data. Depending on the system being deployed, files and data may reside exclusively on the server and are downloaded to temporary local storage only when a file is being actively edited. Or, the system may maintain continuous local and remote versions of your files. Server based systems facilitate team science because they allow multiple users to have access to the same files, and all their respective versions, via the server. They can also provide an important, non-local back-up of your files, protecting you from loss of data should your local storage fail.
Git is a free Server based version control system that can store files both locally and on a remote server. While the sections that follow offer a broader description of Server based version control, in this workshop we will focus only on using Git locally and will not configure the software to communicate with, store files on, or otherwise interact with a remote server. DataLab’s companion “Git for Teams” workshop focuses on using Git with the GitHub cloud service to capitalize on Git’s distributed version control capabilities.
Server based version control systems can generally be segmented into two distinct categories: 1) Centralized Version Control Systems (Centralized VCS) and 2) Distributed Version Control Systems (Distributed VCS).
Central Version Control Systems
Centralized VCS is the oldest and, surprisingly to many, still the dominant form of version control architecture worldwide. Centralized VCS implement a “spoke and wheel” architecture to provided server based version control.
With the spoke and wheel architecture, the server maintains a centralized collection of file versions. Users utilize version control clients to “check-out” a file of interest to their local file storage, where they are free to make changes to the file. Centralized VCS typically restrict other users from checking out editable versions of a file if another user currently has the file checked out. Once the user who has checked out the file has finished making changes, they “check-in” their new version, which is then stored on the server from where it can be retrieved and “checked-out” by another user. As can be seen, Centralized VCS provide a very controlled and ordered universe that ensures file integrity and tracking of changes. However, this regulation comes at a cost. Namely, it reduces the ease with which multiple users can work simultaneously on the same file.
Distributed Version Control Systems
Distributed VCS are not dependent on a central repository as a means of sharing files or tracking versions. Distributed VCS implement a network architecture (as opposed to the spoke and wheel of the Centralized VCS as pictured above) to allow each user to communicate directly with every other user.
In Distributed VCS, each user maintains their own version history of the files being tracked, and the VCS software communicates between users to keep the various local file systems in sync with each other. With this type of system, the local versions of two different users will diverge from each other if both users make changes to the file. This divergence will remain in place until the local repositories are synced, at which time the VCS stitches (or merges) the two different versions of the file into a single version that reflects the changes made by each individual, and then saves the stitched version of the file onto both systems as the current version. Various mechanisms can then be used to resolve the conflicts that may arise during this merge process. Distributed VCS offer greater flexibility and facilitate collaborative work, but a lack of understanding of the sync/merge workflow can cause problems. It is not uncommon for a user to forget to synch their local repository with the repositories of other team members and, as a result, work for extended periods of time on outdated files that don’t reflect their teammates and result in work inefficiencies and merge challenges.
The Best of Both Worlds
An important feature of Distributed VCS is that many users and organizations choose to include a central server as a node in the distributed network. This creates an hybrid universe in which some users will sync directly to each other while other users will sync through a central server.
Syncing with a cloud-based server provides an extra level of backup for your files and also facilitates communication between users. But treating the server as just another node on the network (as opposed to a centralized point of control) puts the control and flexibility back in the hands of the individual developer. For example, in a true Centralized CVS, if the server goes down then nobody can check files in and out of the server, which means that nobody can work. But in a Distributed CVS this is not an issue. Users can continue to work on local versions and the system will sync any changes when the server becomes available. Git, which is the focus of this tutorial, is a Distributed VCS. You can use Git to share and sync repositories directly with other users or through a central Git server such as, for example, GitHub or GitLab.
VCS and the Computer File System
When we think about Version Control, we typically think about managing changes to individual files. From the user perspective, the File is typically the minimum accessible unit of information. Whether working with images, tabular data, or written text, we typically use software to open a File that contains the information we want to view or edit. As such, it comes as a surprise to most users that the concept of Files, and their organizing containers (Folders or Directories), are not intrinsic to how computers themselves store and interact with data. In this section of the tutorial we will learn about how computers store and access information and how VCS interact with this process to track and manage files.
How Computers Store and Access Information
For all of their computing power and seeming intelligence, computers still only know two things: 0 and 1. In computer speak, we call this a binary system, and the unit of memory on a hard-disk, flash drive, or computer chip that stores each 1 or 0 is called a bit. You can think of your computer’s storage device (regardless of what kind it is) as a presenting a large grid, where each box is a bit:
In the above example, as with most computer storage, the bits in our storage grid are addressable, meaning that we can designate a particular bit using a row and column number such as, for example, A7, or E12. Also, remember, that each bit can only contain one of two values: 0 or 1. So, in practice, our storage grid would actually look something like this:
All of the complex information that we store in the computer is translated to this binary language prior to storage using a system called Unicode. You can think of Unicode as a codebook that assigns a unique combination of 8, 16, 32, 64, etc. (depending on how old your computer is) ones and zeros to each letter, numeral, or symbol. For example, the 8-bit Unicode for the upper case letter “A” is “01000001”, and the 8-bit Unicode character for the digit “3” is “00110011”. The above grid actually spells out the phrase, “Call me Ishmael”, the opening line of Herman Melville’s novel Moby Dick.
An important aspect of how computers story information in binary form is that, unlike most human readable forms of data storage, there is no right to left, up or down, or any other regularized organization of bits on a storage medium. When you save a file on your computer, the computer simply looks for any open bits and starts recording information. The net result is that the contents of single file are frequently randomly interleaved with data from other files. This mode of storage is used because it maximizes the use of open bits on the storage device. But it presents the singular problem of not making data readable in a regularized, linear fashion. To solve this problem, all computers reserve a particular part of their internal memory for a “Directory” which stores a sector map of all chunks of data. For example, if you create a file called README.txt with the word “hello” in it, the computer would randomly store the Unicode for the five characters in the word “hello” on the storage device and make a directory entry something like the following:
Understanding the Directory concept and how computers store information is crucial to understanding how VCS mange your Files.
How VCS Manage Your Files
Most users think about version control as a process of managing files. For example, if I might have a directory called “My Project” that holds several files related to this project as follows:
One approach to managing changes to the above project files would be to store multiple versions of each file as in the figure below for the file analysis.r:
In fact, many VCS do exactly this. They treat each file as the minimum unit of data and simply save various versions of each file along with some additional information about the version. This approach can work reasonably well. However, it has limitations. First, this approach can unnecessarily consume space on the local storage device, especially if you are saving many versions of a very large file. It also has difficulty dealing with changes in filenames, typically treating the same file with a new name as a completely new file, thereby breaking the chain of version history.
To combat these issues, good VCS don’t actually manage files at all. They manage Directories. Distributed VCS like Git take this alternate approach to data storage that is Directory, rather than file, based.
Graph-Based Data Management
Git (and many other Distributed VCS) manage your files as collections of data rather than collections of files. Git’s primary unit of management is the “Repository,” or “Repo” for short, which is aligned with your computer’s Directory/Folder structure. Consider, for example, the following file structure:
Here we see a user, Tom’s, home directory, which contains three sub directories (Data, Thesis, and Tools) and one file (Notes.txt). Both the Data and Tools directories contain sub files and/or directories. If Tom wanted to track changes to the two files in the Data directory, he would first create a Git repository by placing the Data directory “under version control.”
When a repository is created, the Git system writes a collection of hidden files into the Data Directory that it uses to store information about all of the data that lives under that directory. This includes information about the addition, renaming, and deletion of both files and folders as well as information about changes to the data contained in the files themselves. Additions, deletions and versions of files are tracked and stored not as copies of files, but rather as a set of instructions that describes changes made to the underling data and the directory structure that describes them.
Additional Resources
The Git Book is the defintive Git resource and provides an excellent reference for everythign that we will cover in the Interactive session. There is no need to read the book prior to the session, but it’s a good reference resource to have avaialable as you begin to work with Git after the workshop.
Introduction to Git
Put some intro text here
Save, Stage, Commit
Git does not automatically preserve versions of every “saved” file. When working with Git, you save files as you always do, but this has no impact on the versions that are preserved in the repository. To create a “versions”, you must first add saved files to a Staging area and then “Commit” your staged files to the repository. The Commits that you make constituted the versions of files that are preserved in the repository.
Creating Your First Repo
Move to your Home directory
$ cd ~
note: The $ character represents your command promt. DO NOT type it into your terminal
Create a new directory for this workshop
$ mkdir introtogit
Change to the new directory
$ cd introtogit
Put the new directory under version control
$ git init
Checking the Status of a Repo
To check the status of a repository use the followign command
$ git status
Version of a File
In Gitspeak, we ‘commit’ if version of a file to the repository to save a copy of the current working version of a file as a version. This is a multi-step process in which we first ‘stage’ the file to be committed and then ‘commit’ the file.
STEP 1: Place the file you want to version into the Staging Area
$ git add <filename>
Replace
STEP 2: Commit Staged Files
$ git commit -m 'A detailed comment explaining the nature of the versio being committed. Do not include any apostrophe's in your comment.'
View a History of Your Commits
To get a history of commits
$ git log
To see commit history with patch data (insertions and deletions) for a specified number of commits
$ git log -p -2
To see abbreviated stats for the commit history
$ git log --stat
You can save a copy of your Git log to a text file with the following command:
$ git --no-pager log > log.txt
Comparing Commits
$ git diff <commit> <commit>
Comparing Files
$ git diff <commit> <file>
or
$ git diff <commit>:<file> <commit>:<file>
To View an Earlier Commit
$ git checkout <commit>
To solve Detached Head problem either RESET HEAD as described below or just chekout another branch
git checkout <branch>
To save this older version as a parallel branch execute
$ git checkout -b <new_branch_name
This will save the older commit as a new branch running parallel to master.
Undoing Things
One of the common undos takes place when you commit too early and possibly forget to add some files, or you mess up your commit message. If you want to redo that commit, make the additional changes you forgot, stage them, and commit again using the –amend option
$ git commit --amend
To unstage a file for commit use
$ git reset HEAD <file>
Throwing away changes you’ve made to a file
$ git checkout -- <file>
Rolling everything back to the last commit
$ git reset --hard HEAD
Rolling everything back to the next to last commit (The commit before the HEAD commit)
$ git reset --hard HEAD^
Rolling everything back tp two commits before the head
$ git reset --hard HEAD^2
Rolling everything back to an identified commit using HASH/ID from log
$ git reset --hard <commit>
When Things go Wrong!
To reset everything back to an earlier commit and make sure that the HEAD pointer is pointing to the newly reset HEAD, do the following
$ git reset --hard <commit>
$ git reset --soft HEAD@{1}
Git Branching
Branching provides a simple way to maintain multiple, side-by-side versions of the files in a repository. Conceptually, branching a repository creates a copy of the codebase in its current state that you can work on without affecting the primary version from which it was copied. This alows you to work down multiple paths without affecting the main (or other) codebase.
To see a list of branches in your repository
$ git branch
To create a new branch
$ git checkout -b hotfix
New branches are created of the current working branch. To change branches use
$ git checkout <branch name>
Merging Branches
When you merge a branch, git folds any changes that you made to files in an identified branch into the current working branch. It also adds any new files. When you perform a merge, a new commit will be automatically created to track the merge. To merge branches, commit any changes to the branch you want to merge (in this example, the ‘hotfix’ branch) then checkout the branch into which you want to merge (for example, master), and then execute a merge command.
$ git commit -m 'commiting staged files in hotfix branch'
$ git checkout master
$ git merge hotfix
Branching Workflows
Introduction to R
Learning objectives
After this lecture, you should be able to:
- define reproducible research and the role of programming languages
- explain what R and RStudio are, how they relate to eachother, and identify the purpose of the different RStudio panes
- create and save a script file for later use; use comments to annotate
- solve simple mathematical operations in R
- create variables and dataframes
- inspect the contents of vectors in R and manipulate their content
- subset and extract values from vectors
- use the help function
Before We Start
What is R and RStudio? “R” is both a free and open source programming language designed for statistical computing and graphics, and the software for interpreting the code written in the R language. RStudio is an integrative development environment (IDE) within which you can write and execute code, and interact with the R software. It’s an interface for working with the R software that allows you to see your code, plots, variables, etc. all on one screen. This functionality can help you work with R, connect it with other tools, and manage your workspace and projects. You cannot run RStudio without having R installed. While RStudio is a commercial product, the free version is sufficient for most researchers.
Why learn R? There are many advantages to working with R.
- Scientific integrity. Working with a scripting language like R facilitates reproducible research. Having the commands for an analysis captured in code promotes transparency and reproducibility. Someone using your code and data should be able to exactly reproduce your analyses. An increasing number of research journals not only encourage, but are beginning to require, submission of code along with a manuscript.
- Many data types and sizes. R was designed for statistical computing and thus incorporates many data structures and types to facilitate analyses. It can also connect to local and cloud databases.
- Graphics. R has buit-in plotting functionalities that allow you to adjust any aspect of your graph to effectively tell the story of your data.
- Open and cross-platform. Because R is free, open-source software that works across many different operating systems, anyone can inspect the source code, and report and fix bugs. It is supported by a large community of users and developers.
- Interdisciplinary and extensible. Because anyone can write and share R packages, it provides a framework for integrating approaches across domains, encouraging innovation.
Navigating the interface
- Source is your script. You can save this as a .R file and re-run to reproduce your results.
- Console - this is where you run the code. You can type directly here, but it won’t save anything entered here when you exit RStudio.
- Environment/history lists all the objects you have created and the commands you have run.
- Files/plots/packages/help/viewer pane is useful for locating files on your machine to read into R, inspecting any graphics you create, seeing a list of available packages, and getting help.
To interact with R, compose your code in the script and use the commands execute (or run) to send them to the console. (Shortcuts: You can use the shortcut Ctrl + Enter, or Cmd + Return, to run a line of code).
Create a script file for today’s lecture and save it to your lecture_4 folder under ist008_2021 in your home directory. (It’s good practice to keep your projects organized., Some suggested sub-folders for a research project might be: data, documents, scripts, and, depending on your needs, other relevant outputs or products such as figures.
Mathematical Operations
R works by the process of “REPL”: Read-Eval-Print Loop:
- R waits for you to type an expression (a single piece of code) and press
Enter. - R then reads in your commands and parses them. It reads whether the command is syntactically correct. If so, it will then
- evaluate the code to compute a result.
- R then prints the result in the console and
- loops back around to wait for your next command.
You can use R like a calculator to see how it processes commands. Arithmetic in R follows an order of operations (aka PEMDAS): parenthesis, exponents, multiplication and division, addition and subtraction.
7 + 2
7 - 2
244/12
2 * 12To see the complete order of operations, use the help command:
?SyntaxHELP!
This is just the beginning, and there are lots of resources to help you learn more. R has built-in help files that can be accessed with the ‘?’ and args() commands. You can search within the help documentation using the ?? commands. (Note: to get help with arithmetic commands you must put the symbol in single or double quotes.) You can view the package documentation using packageDescription(“Name”). And, you can always ask the community: Google, Stack Overflow [r], topic-specific mailing lists, and the R-help mailing list. On CRAN, check out the Intro to R Manual and R FAQ. When asking for help, clearly state the problem and provide a reproducible example. R also has a posting guide to help you write questions that are more likely to get a helpful reply. It’s also a good idea to save your sessionInfo() so you can show others how your machine and session was configured.
Calls
R has many functions (reusable commands) built-in that allow you to compute mathematical operations, statistics, and other computing tasks. Code that uses a function is said to call that function. When you call a function, the values that you assign as input are called arguments. Some functions have multiple parameters and can accept multiple arguments.
log(10)
sqrt(9)
sum(5, 4, 1)Variables
A variable is a name for a stored value. Variables allow you to reuse the result of a computation, write general expressions (such as ax + b), and break up your code into smaller steps so it’s easier to test and understand. Variable names can contain letters or numbers, but they cannot begin with a number. In general, variable names should be descriptive but concise, and should not use the same name as common (base R) functions, like mean, T, median, sum, etc.
x <- 10
y <- 24
fantastic.variable2 = x
x<-y/2In R, variables are copy-on-write. When we change a variable (a “write”), R automatically copies the original value so dependent variables are unchanged until they are re-run.
x = 13
y = x
x = 16
yData Types and Classes
R categorizes data into different types that specify how the object is stored in memory. The typeof() command will return the data type of an object. These types map to how we categorize data in statistics:
- continuous (real numbers)
- discrete (integers, or finite number of values)
- logical (1 or 0, T or F)
- nominal (unordered categorical values)
- ordinal (ordered categorical values)
- graph (network data)
- character (text data)
Perhaps more useful for day-to-day programming is an object’s class, which specifies how it behaves. Classes in R are hierarchical:
- logical (TRUE, FALSE)
- integer (2, 4, 7)
- numeric (double, 2, 3, 5.7)
- complex (3i)
- character (“marie curie”,“grace hooper”)
x<- 2
class(x)
y<- "two"
class(y)
class(TRUE)
class(mean)Vectors
A vector is an ordered collection of values. The elements in the vector must have the same data type. (While class and type are independent, for vectors they are typically the same and thus you can expect that they typically should have the same class.) You can combine or concatenate values to create a vector using c().
v<-c(16, 3, 4, 2, 3, 1, 4, 2, 0, 7, 7, 8, 8, 2, 25)
class(v)
place <- c("Mandro", "Cruess", "ARC", "CoHo", "PES", "Walker", "ARC",
"Tennis Courts", "Library", "Arboretum", "Arboretum", "Disneyland", "West
Village", "iTea", "MU")
class(place)What happens if you make a typo or try to combine different data types in the same vector? R resolves this for you and automatically converts elements within the vector to be the same data type. It does so through implicit coercion where it conserves the most information possible (logical -> integer -> numeric -> complex -> character). Sometimes this is very helpful, and sometimes it isn’t.
Basic statistics on vectors
You can use functions built into R to inspect a vector and calculate basic statistics.
length(v) # returns how many elements are within the object
length(place)min(v) # minimum value
max(v) # maximum value
mean(v)
median(v)
sd(v) # standard deviationMatrices, Arrays & Lists
Matrices are two-dimensional containers for values. All elements within a matrix must have the same data type. Arrays generalize vectors and matrices to higher dimensions. In contrast, lists are containers for elements with different data types.
Data Frames
We frequently work with 2-dimensional tables of data. For a tabular data set, typically each row corresponds to a single subject and is called an observation. Each column corresponds to the data measures or responses – a feature or covariable. (Sometimes people will also refer to these as variables, but that can be confusing as “variable” means something else in R, so here we’ll try to avoid that term.) R’s structure for tabular data is the data frame.
A data frame is a list of column vectors. Thus, elements of a column must all have the same type (like a vector), but elements of a row can have different types (like a list). Additionally, every row must be the same length. To make a data frame in R, you can combine vectors using the data.frame() command.
distance.mi <- c(3.1, 0.6, 0.8, 0.2, 0.5, 0.2, 0.7, 0.5, 0, 1.2, 1.2, 501, 1.6,
0.4, 4.7)
time.min <- v
major <- c("nutrition", "psychology", "global disease", "political science",
"sociology", "sustainable agriculture", "economics", "political science",
"undeclared", "psychology", "undeclared","economics","political science",
"english", "economics")
my.data <- data.frame(place, distance.mi, time.min, major)Inspecting Data Frames
You can print a small dataset, but it can be slow and hard to read especially if there are a lot of coumns. R has many other functions to inspect objects:
head(my.data)
tail(my.data)
nrow(my.data)
ncol(my.data)
ls(my.data)
rownames(my.data)
str(my.data)
summary(my.data)Subsetting
Sometimes you will want to work with only specific elements in a vector or data frame. To do that, you can refer to the position of the element, which is also also called the index.
length(time.min)
time.min[15]You can also subset by using the name of an element in a list. The $ operator extracts a named element from a list, and is useful for extracting the columns from data frames.
How can we use subsetting to look only at the distance response?
my.data$distance.mi
my.data[,2]
distances2<-my.data[["distance.mi"]]
distances3<-my.data[[2]]What are the responses for political science majors?
polisci_majors <- my.data[which(my.data$major == 'political science'), ]
View(polisci_majors)
which(my.data$major == "political science")
shortframe<-my.data[c(4,8,13),]What are the majors of the first 5 students who replied?
shortframe2 <- my.data[1:5,"major"] # range for rows, columnsYou can also use $ to create an element within the data frame.
my.data$mpm <- my.data$distance.mi / my.data$time.minFactors* are the class that R uses to represent categorical data. Levels are categories of a factor.
levels(my.data$major)Control Structures
Control Structures are functions in computer programming the evaluate conditions (like, for example, the value of a variable) and change the way code behaves based upon evaluated values. For example, you might to perform one function if the value stored in the variable x is greater than 5 and a different function if it is less than less than 5. The Wikiversit Control Structures page contains a good, general description of control structures that is not programming language specific. The information that follows provides examples of the most frequetly used R control structures and how to implement them. For more complete documentation on control strcutures in R run the following help command:
?Control
If Statement
The “If Statement” is the most basic of the R control structures. It tests whether a particular condition is true. For example, the below statement tests whether the value of the variable x is greater than 5. If it is, the code prints the phrase “Yay!” to screen. If it is not, the code does nothing:
x <- 7
if (x > 5) {
print("Yay!")
}
Note, the general syntax in the example is:
control_statement (condition) {
#code to execute condition is true
}
While you will occasionally see variations in how control structures are present, this is a fairly universal syntax across computer programming languages. The specific control structure being invoked is followed by the condition to be tested. Any actions to be performed if the condition evaluates to TRUE are place between curly brackets {} following the condition.
Relationship Operators
The most common conditions evaluate whether one value is equal to ( x == y), equal to or greater than (x => y), equal to or lesser than (x <= y), greater than (x > y), or lesser than (x < y) another value.
Another common task is to test whether a BOOLEAN value is TRUE or FALSE. The syntax for this evaluation is:
if (*x*) { #do something}
Control structures in R also have a negation symbol which allows you to specify a negative condition. For example, the conditional statement in the following code evaluates to TRUE (meaning any code placed between the curly brackets will be executed) if the x IS NOT EQUAL to 5:
if (x !=5) { #do something}
If Else Statement
The “If Else” statement is similar to the “If Statement,” but it allows you specify one code path to execute if the conditional evaluates to TRUE and another to execute if the conditional evaluates to FALSE:
x <- 7
if (x > 5) {
print("Yay!")
} else {
print("Boo!")
}
ifelse Statement
R also offers a combined if/else syntax for quick execution of small code chunks:
x <- 12
ifelse(x <= 10, "x less than 10", "x greater than 10")
The switch Statement
The switch statement provides a mechanism for selecting between multiple possible conditions. For example, the following code returns one of several possible values from a list based upon the value of a variable:
x <- 3
switch(x,"red","green","blue")
Note: if you pass switch a value that exceeds the number of elements in the list R will not compute a reply.
The which Statement
The which statement is not a true conditional statement, but it provides a very useful way to test the values of a dataset and tell you which elements match a particular condition. In the example below, we load the R IRIS dataset and find out which rows have a Petal.Length greater than 1.4:
data("iris")
rows <- which(iris$Petal.Length > 1.4)
note: you can see all of the R. build in datasets with the data() command.
Iterating (Loops)
In computer programming iteration is a specific type of control structure that repeatedly runs a specified operation either for a set numbe of iterations or untul some condition is met. For example, you might want your code to peform the same math operation on all of the numbers stored in a vector of values; or perhaps you want the computer to look through a list until it finds the first entry with a value greater than 10; or, maybe you just want the computer to sound an alarm exactly 5 times. Each of these is a type of iteration or “Loop” as they are also commonly called.
For i in x Loops
The most common type of loop is the “For i in x” loop which interates through each value (i) in a list (x) and does something with each value. For example, assume that x is a vector containing the following four names names: Sue, John, Heather, George, and that we want to print each of these names to screen. We can do so with the followig code:
x <- c("Sue", "John", "Heather", "George")
for (i in x) {
print(i)
}
In the first line of code, we create our vecctor of names (x). Next we begin our “For i in x loop”, which has the following general syntax, which is similar to that of the conditional statements you’ve already mastered:
for (condition) {}
Beginning with the first element of the vector x, which in our case is “Sue”, for each iteration of the for loop the value of the corresponding element in x is assiged to the variable i and then i can be acted upon in the code icnluded between the curly brackets of the function call. In our case we simply tell the conputer to print the value of i to the sreen. Witgh each iteration, the next value in our vector is assigned to i and is subsequently printed to screen, resulting in the following output:
[1] "Sue"
[1] "John"
[1] "Heather"
[1] "George"
In addition to acting on vectors or lists, For loops can also be coded to simply execute a chunk of code a designated number of times. For example, the following code will print “Hello World!” to screen exactly 10 times:
for (i in 1:10) {
print("Hello World!"
}
While Loops
Unlike For loops, which iterate a defined number of times based on the length of a list of range of values provided in the method declaration, While loops continue to iterate infinitely as long as (while) a defined condition is met. For example, assume you have a boolean variable x the value of which is TRUE. You might want to write code that performs some function repeatly until the value of x is switched to FALSE. A good example of this is a case where your program asks the user to enter data, which can then be evaluated for correctness before the you allow the program to move on in its execution. In the example below, we ask the user to tell us the secret of the universe. If the user answeres with the correct answer (42), the code moves on. But if the user provides and incorrect answer, the code iterates back to the beginning of the loop and asks for input again.
response <- 0
while (response!=42) {
response <- as.integer(readline(prompt="What is the answer to the Ultimate Question of Life, the Universe, and Everything? "));
}
Repeat Loops
Like While loops, Repeat loops continue to iterate until a specified condition is met; but with Repeat loops that condition is defined not as an argument to the function but is a specific call to “break” that appears in the functions executable code. In the example below we assign the value 1 to a variable i and then loop through code that prints and then iterates the value of i until it reaches 10, at which time we forceably exit the loop:
i <- 1
repeat {
print(i)
i = i+1
if (i > 10){
break
}
}
Break and Next
In the previous section we saw the use of the break statement to force an exit from a repeat loop based on a conditional evaluation in an if statement. Break can actually be used inside any conditional (for, while, repeat) in order to force the end of iteration. This can be useful in a variety of contexts where you want to test for multiple conditions as a means of stopping iteration.
The next command is similar to break in that it can be used inside any iteration structure to force R to skip execution of the iteration code for particular cases only. For example, we use next below to iterate through the nunbers 1 to 10 and print all values to screen EXCEPT the value 5:
for (i in 1:10) {
if (i == 5){
next
}
print(i)
}
Iterating Data.Frame Rows in R
In the section on for loops above, we learned that you can easily iterate across all values of a list using a “for i in x” loop. Working with R data.frames adds a bit of complexity to this process. Because R was developed as a language for statistial analysis, which always involves the comparison of multiple observations of the same variable (for example, all of the weights recroded across all patients), the default behavior of the “for i in x” loop when applied to data.frames is to iterate across columns (variables) rather than rows (observations). Consider the following example:
for (i in iris) {
print(i)
}
If you run the above code, in the first iteration R will assign the vector of values contained in the firt column (Sepal.Length) to i, in the second iteration it will assign vectore of values contained in the second column (Sepal.Width) to i, etc.
Iterating through the data columns of a data.frame is useful for many (if not most) operations. However, there are time when we want to iterate through data one observation at a time. To accomplish this, we nee do specifically direct R to move through the data.frame by row, as follows:
for (i in 1:nrow(iris)) {
thisrow <- iris[i,]
print(thisrow)
}
lapply()
R has a built-in class of functions known as the apply family that provide a shorthand for iterating through collections of data. These behave like a for loop, but require much less actual code to accomplish. The lapply function iterates across lists, such as vectors. When you invoke lapply it applies a defined operation to each item in the subitted list and returns a list of equal length that contains the results of this calculation. In the code below, we assign the values 1 through 10 to a vector and then use lapply to subtract 1 from each item in the vector and finally print the results to screen:
v <- c(1:10)
results <- lapply(v, function(x) (x-1))
print(results)
We could accomplish the exact same thing with the following for loop
v <- c(1:10)
for (i in v) {
x <- i - 1
print(x)
}
The basic syntax of lapply is:
lapply(list, function)
where “list” is some list object supplied and “function” is pre-defined chunk of code that will be exectuted. You’ll learn more about functions in a future lesson.
Packages and Functions
Learning objectives
After this lecture, you should be able to:
- explain what a function is
- read and understand the basic syntax of a function in R
- use this syntax to call a function
- use this syntax to build your own function
- test your function
- install packages in R
- load libraries in R
What is a function?
Why build code several or a hundred times when you can build it once and then call and run it as many times as you want? The answer is, don’t! A function allows you to perform an action multiple times in R by calling it and applying it in similar contexts.
For instance, if you build a function that checks the class of all vectors in a dataframe, you can name this function and then apply it to do the same operation with any other dataframe. Or, if you build a function that graphs the correlation between two numeric vectors and exports this graph to a .png file, you can call this same function and apply it to two other vectors, again and again as needed. Functions can greatly increase the efficiency of your programming, and allow you to create flexible and customized solutions.
What is the basic syntax of a function in R?
The basic syntax of a function in R, or the way it should be written so that R recognizes it and applies it do perform actions, is usually stated as follows:
function_name <- function(argument_1, argument_2, ...) {
Function body
}
What this does not demonstrate is that there are actually two steps to a function: building it, and applying it. We will look at both steps in the following code from DataCamp:
Step 1: Building a function
myFirstFun<-function(n)
{
# Compute the square of integer `n`
n*n
}
The code chunk builds the function, setting “myFirstFun” as the name, or variable, to which they have assigned the function. The function itself runs from the word “function” down through the closing curly brace.
What is an argument? In the above example, “(n)” is the argument. R looks for this argument (in this case, “n”) in the body of the function, which in this case is n*n.
When we run the above script, the function is saved as an object into the global environment so that it can be called elsewhere, as demonstrated in the code chunks below.
The function has no effect unless you apply it. Until that happens, the function will do nothing but wait to be called.
Step 2: Calling the function
The code chunk below calls “myFirstFun(n)” and tells R to assign the results of the operation the function performs (n*n) to the variable “u”. But if we run this code as it is (with “n” in the parentheses), we will get an error (unless we have previously assigned “n” as a variable with a value that will accept the operation to be performed — so “n” needs to be a number in this case so that it can be multiplied). We do not actually want to perform the function on the letter “n” but rather, on a number that we will insert in the place of “n.”
We can apply this function by setting “n” as a number, such as 2, in the example below.
# Call the function with argument `n`
u <- myFirstFun(2)
# Call `u`
u
Once we have changed “n” to a number, R then performs this operation and saves the result to a new variable “u”. We can then ask R to tell us what “u” is, and R returns or prints the results of the function, which in this case, is the number 4 (2*2).
The image below shows the results we get if we attempt to run the function without changing the argument “n” to a number (giving us an error), and the results when we change “n” to the number “2” which assigns the result of the function (4) to “u”, or the number “3” which assigns the result of the function (now 9) to “u”.
It is important to understand that “n” is an argument of the function “myFirstFun.” R does not consider “n” a variable, but it acts like a variable because it can change as you call the function into different contexts. To R, “u” and “myFirstFun” are variables because they are names to which values and other content are assigned.
Here is another example of a function with one argument:
Step 1: Build the function In the code below, we will build a function that checks the classes of all vectors in a dataframe.
#build function with one argument (variable)
check_class <- function(data) {
lapply(data, class)
}
Step 2: Call the function in one or more contexts. In the code below, we will call the function we built above and apply it to two different datasets. Just as we saw in the example above where we inserted the numbers 2 or 3 in place of “n”, we will insert the name of the datasets we want to use in place of the word “data” to call the new function we have built.
- Note: you will need to load the built-in R datasets “mtcars” and “iris” in order to test the code below.*
#run check_class function on two different dataframes
check_class(mtcars)
check_class(iris)
A function can have more than one argument
A function works similarly when it has two or more arguments.
Let’s say we only want to look at the first vector or column in the dataframe “mtcars.” We would write a line of code that looks like this:
#pull the values of the first column /vector in the dataframe "mtcars"
mtcars[1]
But if we wanted to create a function that looks at any column/vector in any dataframe, we could write a function that looks like this:
#build function with two arguments (variable)
one_column <- function(data, x) {
data[x]
}
Note: if we want to tell a user what kind of input we want to include, we could instead do something like function(dataset, column_position) or function(dataset, column_name).
Once we have run the above function (telling R to save it to the global environment), we would then call this new function, which we have named one_column, and apply it to various dataframes, and telling R which column or vector in each dataframe we want to view.
#run one_column function on two different dataframes
one_column(mtcars, 1)
one_column(iris, 2)
#Packages
A package is a set of functions that other users and developers have made that allow R users to perform various operations. As with many applications and software, some R packages are well crafted, documented, and updated frequently, while others are not. You will want to use your best judgment and choose packages that you think will help you in your work, but will remain stable and functional. Try adding the packages below:
dplyr
wakefield
rlang
Go to Tools > Install Packages in RStudio, search for the functions, and then follow the steps to install them.
Once you have installed them, you will then need to load the libraries into your R environment by using the following code:
#load libraries
library(dplyr)
library(wakefield)
library(rlang)
Click here to find out more about dplyr Click here to find out more about wakefield Click here to find out more about rlang
If you have installed the above packages and loaded their libraries, you can then create a function that uses the table you made in the earlier session, “Introduction to R,” to add five rows of data, add a logical vector with randomly assigned logical values, and save this as a new table. Your function might look something like this code below. The comment tags indicate what each line of the function will do.
Note: you will need to load in your data for my.table with the initial 15 rows before proceeding with the next steps.
#Step 1: Build a function that adds a logical vector with randomly assigned TRUE/FALSE values
make_logical.vec <- function(dataset, new.col) {
#make logical vector with random values
vector_1 <- r_sample_logical(15, prob = NULL, name = "new.vector") %>% as.logical()
#tell R to read input for the name of new.col so that we can assign this name to the vector/column
colName = quo_name(new.col)
#add our new vector, with the name we have specified, to the dataset
dataset %>% mutate(!!quo_name(colName) := vector_1) #create new
}
#Step 2: Call our new function ‘make_logical.vec’ and assign the results to the table ‘my.data’.
my.data <- make_logical.vec(my.data, "logical.vec")
Note: If we call the function, setting the dataset to ‘my.data’ and the name of the new vector to ‘logical.vec’, it will create the dataframe but will only print it for us in our console. If we want to actually save the new dataframe to update our existing dataframe, we need to reassign it to ‘my.data’, so that the updated dataframe replaces the original dataframe.
As we can see in the code above, a function can contain more than one variable, and can include several or many lines of code and perform many operations. The above example demonstrates this, and also shows that packages such as as the ones we have loaded here, while optional for working in R, can allow you to call many useful functions.
Using a package and function to graph data and export a .png
We can install and load the ‘ggplot’ or ‘ggforce’ package to graph data from a dataframe and export the graph to a file. For example, below we can build a function that graphs the data from two columns/vectors, and then generates a .png file.
We start by loading the package that contains the plotting functions we want to use:
Note: if you have not installed ggforce already, you will want to do that now.
#graph distance and time from our.data
library(ggforce)
Next, we could build a function that looks like this:
# write function that graphs two variables from dataset
graph_data <- function(data, column1, column2, n) {
distances <- (data %>% filter(column1 <= "n")) %>%
ggplot(aes(column2, column1)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE)
png("graph.png")
print(graph)
dev.off()
}
Lastly, we can call the above function, and apply it to the dataframe, my.table, to compare distance and time of travel.
graph_data(my.data, my.data$distance.mi, my.data$time.min, 14.0)
The above function generates a .png that looks like this:
Saving functions and calling them from another file
You can save the functions you build to a separate file, and then load these as a source. For example, I might save my functions to an R script, called “functions.r”. I can then load these sources along with my packages into my R environment.
*Note: Although we loaded libraries as we went through this lesson, the best practice is to run your packages and source files at the very beginning of your new R script, as shown in the example that follows.
library(dplyr)
library(wakefield)
library(rlang)
library(ggforce)
source("functions.r")
The above code will allow you to call functions that are saved in these libraries and in the functions.r file.
File Input and Output
This lesson will cover some standard functions for reading and writing data in R.
Objectives
- getting and setting working directory
- save and load R objects to/from disk
- read and write tabular data
- read data from a url
Basic Idea
As a data scientist, you will constantly be reading from and writing to files. Generally you are given some dataset that you need to analyze and report on. This means that you need to load the data into R, run some code, and finally save some outputs. This is all done with files.
File Formats
When people talk about binary files, vs text files what they really mean is - is it human readable? A text file should have text data that a human can read with a text editor. A binary file has binary data that a human can’t really read, but the appropriate software can.
Filesystems and Paths
At a high level, files are information stored on a computer. Each file has a name, and a unique file path. A filepath is the location of the file in the storage device. A file name as two parts: the name and the file extension.
The file extension (everything after the .) is meant to indicate to the user (you) and the operating system what the file contains. This hint means you often times don’t need to open the file to know what kind of data it has. However, the extension is not enforced by anything, its just a useful suggestion.
Paths can be relative or absolute. An absolute path is the full path through the filesystem to reach a file. A relative path is the path from some starting point to the file you want to reach. You can consider a relative path as something that needs to be combined with another path to reach a file.
get and set working directory
Before we begin, we will get and set our working directory in R. You can think of the working directory as the part that gets combined with the relative path.
getwd() will return the absolute file path of the working directory
getwd()Call setwd to set the working directory to a path you specify as an input argument.
setwd("~/Documents/file_io/") # notice the argument
getwd()A really useful function in R is list.files(), which lists all the files at a given path. Listing the files should confirm for us that we are in the right place.
list.files() saving and loading R data
rds
A simple way to save an R object directly to a file, such that it can be loaded into another R session is with the saveRDS function. saveRDS will write a single object to a specified file path. By default, it will save the object as a binary representation. This can be very useful for large objects, as the binary format will be significantly more space efficient.
y = c(0,1,2,3,4)
saveRDS(y, file="myvectors.rds")Confirm that it worked.
list.files()The counterpart to saveRDS is readRDS. With readRDS, you can load in an rds file, which by definition contains a single R object, and assign it to a variable in your session.
x = readRDS("myvector.rds")This will work for any R object. For example.
saveRDS(mtcars, file="mtcars.rds")
my_cars = readRDS("mtcars.rds")Saving and loading using readRDS is really powerful to save data. However, it does have a pretty significant drawback - its useless outside of R. For someone to explore the data, they would need to load R.
Reading and Writing Text data
In addition to saveRDS and readRDS, R has functions for working with text files.
These are commonly used for getting external data into R. And for exporting your data so that it can be used by other people.
tabular data
Generally the data you work with in R will be tabular. Dataframes are an example of tabular data.
read and write table
To write tabular data from a text file use the write.table function. Before running it, lets look at the documentation and understand the key arguments.
?write.table The important arguments are x, file, and sep.
x is the dataframe you are saving. file is the name of the file you want to create and write to. sep is the field separator, also called delimiter.
Notice that if file is left blank, then R will just print the results to the console, instead of into a file. Lets use this to explore what the sep argument does.
small = head(my_cars)
write.table(small)
write.table(small, sep=" ")
write.table(small, sep=".")
write.table(small, sep=",")Lets write our data to a file.
write.table(my_cars, file="cars.txt")
list.files() # confirm it workedNow lets read the data back in.
from_cars.txt = read.table("cars.txt") Always inspect your data to make sure everything worked
colnames(from_cars.txt)
dim(from_cars.txt)
head(from_cars.txt)CSV format
A CSV (comma separated values) file is a text file that uses ‘,’ as the field separator. This is probably the most commonly used format for plain text tabular data.
To write a csv in R use the write.csv function. This is equivalent to write.table(from_cars.txt, file="cars.csv", sep=",")
write.csv(from_cars.txt, file="cars.csv")
from_cars.csv = read.csv("cars.csv")Again, double check that everything worked.
head(from_cars.csv)
colnames(from_cars.csv)
dim(from_cars.csv)What went wrong here? In this case it was ambiguous if the first column was rownames or actual values.
Lets fix it
temp = from_cars.csv[, 2:12]
rownames(temp) = from_cars.csv[,1]
fixed = tempWith these sorts of problems, you can generally fix them by using the appropriate arguments to the function calls of the read and write functions. Notice the argument in this function call. It specifies that the rownames can be read in from the first column of the tabular data in the file.
from_cars.csv = read.csv("cars.csv", row.names =1) Non tabular data
There are functions in R for reading and writing text data that doesn’t represent tabular data. A common one is writeLines and readLines.
texts = c("line one", "line two")
writeLines(texts, "raw.txt")
texts2 = readLines("raw.txt")URLS as files
Files can be transferred over the internet. URLs are a type of filepath, that denotes a filepath, and the computer that file is stored on. Many functions in R that involve reading and writing from files, can be given a url as the filepath argument. In that case, the file will be transferred over the internet, onto your computer, and then read into R.
Here is an example of reading in a file from a url, using the readLines function.
url = "https://datalab.ucdavis.edu"
t = readLines(url)Strings and Regular Expressions
After this lesson, you should be able to:
- Print strings with
cat - Read and write escape sequences and raw strings
- With the stringr package:
- Split strings on a pattern
- Replace parts of a string that match a pattern
- Extract parts of a string that match a pattern
- Read and write regular expressions, including:
- Anchors
^and$ - Character classes
[] - Quantifiers
?,*, and+ - Groups
()
- Anchors
Printing Output
The cat function prints a string in the R console. If you pass multiple arguments, they will be concatenated:
cat("Hello")## Hello
cat("Hello", "Nick")## Hello Nick
Pitfall 1: Printing a string is different from returning a string. The cat function only prints (and always returns NULL). For example:
f = function() {
cat("Hello")
}
x = f()## Hello
x## NULL
If you just want to concatenate some strings (but not necessarily print them), use paste instead of cat. The paste function returns a string. The str_c function in stringr (a package we’ll learn about later in this lesson) can also concatenate strings.
Pitfall 2: Remember to print strings with the cat function, not the print function. The print function prints R’s representation of an object, the same as if you had entered the object in the console without calling print.
For instance, print prints quotes around strings, whereas cat does not:
print("Hello")## [1] "Hello"
cat("Hello")## Hello
Escape Sequences
In a string, an escape sequence or escape code consists of a backslash followed by one or more characters. Escape sequences make it possible to:
- Write quotes or backslashes within a string
- Write characters that don’t appear on your keyboard (for example, characters in a foreign language)
For example, the escape sequence \n corresponds to the newline character. Notice that the cat function translates \n into a literal new line, whereas the print function doesn’t:
x = "Hello\nNick"
cat(x)## Hello
## Nick
print(x)## [1] "Hello\nNick"
As another example, suppose we want to put a literal quote in a string. We can either enclose the string in the other kind of quotes, or escape the quotes in the string:
x = 'She said, "Hi"'
cat(x)## She said, "Hi"
y = "She said, \"Hi\""
cat(y)## She said, "Hi"
Since escape sequences begin with backslash, we also need to use an escape sequence to write a literal backslash. The escape sequence for a literal backslash is two backslashes:
x = "\\"
cat(x)## \
There’s a complete list of escape sequences for R in the ?Quotes help file. Other programming languages also use escape sequences, and many of them are the same as in R.
Raw Strings
A raw string is a string where escape sequences are turned off. Raw strings are especially useful for writing regular expressions, which we’ll do later in this lesson.
Raw strings begin with r" and an opening delimiter (, [, or {. Raw strings end with a matching closing delimiter and quote. For example:
x = r"(quotes " and backslashes \)"
cat(x)## quotes " and backslashes \
Raw strings were added to R in version 4.0 (April 2020), and won’t work correctly in older versions.
Character Encodings
Computers store data as numbers. In order to store text on a computer, we have to agree on a character encoding, a system for mapping characters to numbers. For example, in ASCII, one of the most popular encodings in the United States, the character a maps to the number 97.
Many different character encodings exist, and sharing text used to be an inconvenient process of asking or trying to guess the correct encoding. This was so inconvenient that in the 1980s, software engineers around the world united to create the Unicode standard. Unicode includes symbols for nearly all languages in use today, as well as emoji and many ancient languages (such as Egyptian hieroglyphs).
Unicode maps characters to numbers, but unlike a character encoding, it doesn’t dictate how those numbers should be mapped to bytes (sequences of ones and zeroes). As a result, there are several different character encodings that support and are synonymous with Unicode. The most popular of these is UTF-8.
In R, we can write Unicode characters with the escape sequence \U followed by the number for the character in base 16. For instance, the number for a in Unicode is 97 (the same as in ASCII). In base 16, 97 is 61. So we can write an a as:
x = "\U61" # or "\u61"
x## [1] "a"
Unicode escape sequences are usually only used for characters that are not easy to type. For example, the cat emoji is number 1f408 (in base 16) in Unicode. So the string "\U1f408" is the cat emoji.
Character Encodings in Text Files
Most of the time, R will handle character encodings for you automatically. However, if you ever read or write a text file (including CSV and other formats) and the text looks like gibberish, it might be an encoding problem. This is especially true on Windows, the only modern operating system that does not (yet) use UTF-8 as the default encoding.
Encoding problems when reading a file can usually be fixed by passing the encoding to the function doing the reading. For instance, the code to read a UTF-8 encoded CSV file on Windows is:
read.csv("my_data.csv", fileEncoding = "UTF-8")Other reader functions may use a different parameter to set the encoding, so always check the documentation. On computers where the native language is not set to English, it can also help to set R’s native language to English with Sys.setlocale(locale = "English").
Encoding problems when writing a file are slightly more complicated to fix. See this blog post for thorough explanation.
The Tidyverse
The Tidyverse is a popular collection of packages for doing data science in R. The packages are made by many of the same people that make RStudio. They provide alternatives to R’s built-in tools for:
- Manipulating strings (package
stringr) - Making visualizations (package
ggplot2) - Reading files (package
readr) - Manipulating data frames (packages
dplyr,tidyr,tibble) - And more
Think of the Tidyverse as a different dialect of R. Sometimes the syntax is different, and sometimes ideas are easier or harder to express concisely. Whether to use base R or the Tidyverse is mostly subjective. As a result, the Tidyverse is somewhat polarizing in the R community. It’s useful to be literate in both, since both are popular.
One advantage of the Tidyverse is that the packages are usually well-documented. For example, there are documentation websites and cheat sheets for most Tidyverse packages.
The stringr Package
The rest of this lesson uses stringr, the Tidyverse package for string processing. R also has built-in functions for string processing. The main advantage of stringr is that all of the functions use a common set of parameters, so they’re easier to learn and remember.
The first time you use stringr, you’ll have to install it with install.packages (the same as any other package). Then you can load the package with the library function:
# install.packages("stringr")
library(stringr)The typical syntax of a stringr function is:
str_NAME(string, pattern, ...)
Where:
NAMEdescribes what the function doesstringis the string to search within or transformpatternis the pattern to search for...is additional, function-specific arguments
For example, the str_detect function detects whether the pattern appears within the string:
str_detect("hello", "el")## [1] TRUE
str_detect("hello", "ol")## [1] FALSE
Most of the stringr functions are vectorized:
str_detect(c("hello", "goodbye", "lo"), "lo")## [1] TRUE FALSE TRUE
There are a lot of stringr functions. The remainder of this lesson focuses on three that are especially important, as well as some of their variants:
str_split_fixedstr_replacestr_match
You can find a complete list of stringr functions with examples in the documentation or cheat sheet.
Splitting Strings
The str_split function splits the string at each position that matches the pattern. The characters that match are thrown away.
For example, suppose we want to split a sentence into words. Since there’s a space between each word, we can use a space as the pattern:
x = "The students in this class are great!"
result = str_split(x, " ")
result## [[1]]
## [1] "The" "students" "in" "this" "class" "are" "great!"
The str_split function always returns a list with one element for each input string. Here the list only has one element because x only has one element. We can get the first element with:
result[[1]]## [1] "The" "students" "in" "this" "class" "are" "great!"
To see why the function returns a list, consider what happens if we try to split two different sentences at once:
x = c(x, "Are you listening?")
result = str_split(x, " ")
result[[1]]## [1] "The" "students" "in" "this" "class" "are" "great!"
result[[2]]## [1] "Are" "you" "listening?"
Each sentence has a different number of words, so the vectors in the result have different lengths. So a list is the only way to store both.
The str_split_fixed function is almost the same as str_split, but takes a third argument for the maximum number of splits to make. Because the number of splits is fixed, the function can return the result in a matrix instead of a list. For example:
str_split_fixed(x, " ", 3)## [,1] [,2] [,3]
## [1,] "The" "students" "in this class are great!"
## [2,] "Are" "you" "listening?"
The str_split_fixed function is often more convenient than str_split because the nth piece of each input string is just the nth column of the result.
For example, suppose we want to get the area code from some phone numbers:
phones = c("717-555-3421", "629-555-8902", "903-555-6781")
result = str_split_fixed(phones, "-", 3)
result[, 1]## [1] "717" "629" "903"
Replacing Parts of Strings
The str_replace function replaces the pattern the first time it appears in the string. The replacement goes in the third argument.
For instance, suppose we want to change the word "dog" to "cat":
x = c("dogs are great, dogs are fun", "dogs are fluffy")
str_replace(x, "dog", "cat")## [1] "cats are great, dogs are fun" "cats are fluffy"
The str_replace_all function replaces the pattern every time it appears in the string:
str_replace_all(x, "dog", "cat")## [1] "cats are great, cats are fun" "cats are fluffy"
We can also use the str_replace and str_replace_all functions to delete part of a string by setting the replacement to the empty string "".
For example, suppose we want to delete the comma:
str_replace(x, ",", "")## [1] "dogs are great dogs are fun" "dogs are fluffy"
In general, stringr functions with the _all suffix affect all matches. Functions without _all only affect the first match.
We’ll learn about str_match at the end of the next section.
Regular Expressions
The stringr functions (including the ones we just learned) use a special language called regular expressions or regex for the pattern. The regular expressions language is also used in many other programming languages besides R.
A regular expression can describe a complicated pattern in just a few characters, because some characters, called metacharacters, have special meanings. Letters and numbers are never metacharacters. They’re always literal.
Here are a few examples of metacharacters (we’ll look at examples in the subsequent sections):
| Metacharacter | Meaning |
|---|---|
. |
any single character (wildcard) |
\ |
escape character (in both R and regex) |
^ |
beginning of string |
$ |
end of string |
[ab] |
'a' or 'b' |
[^ab] |
any character except 'a' or 'b' |
? |
previous character appears 0 or 1 times |
* |
previous character appears 0 or more times |
+ |
previous character appears 1 or more times |
() |
make a group |
More metacharacters are listed on the stringr cheatsheet, or in ?regex.
The Wildcard
The str_view function is especially helpful for testing regular expressions. It opens a browser window with the first match in the string highlighted. We’ll use it in the subsequent regex examples.
The regex wildcard character is . and matches any single character.
For example:
x = "dog"
str_view(x, "d.g")By default, regex searches from left to right:
str_view(x, ".")Escape Sequences
Like R, regular expressions can contain escape sequences that begin with a backslash. These are computed separately and after R escape sequences. The main use for escape sequences in regex is to turn a metacharacter into a literal character.
For example, suppose we want to match a literal dot .. The regex for a literal dot is \.. Since backslashes in R strings have to be escaped, the R string for this regex is "\\.. Then the regex works:
str_view("this.string", "\\.")The double backslash can be confusing, and it gets worse if we want to match a literal backslash. We have to escape the backslash in the regex (because backslash is the regex escape character) and then also have to escape the backslashes in R (because backslash is also the R escape character). So to match a single literal backslash in R, the code is:
str_view("this\\that", "\\\\")Raw strings are helpful here, because they make the backslash literal in R strings (but still not in regex). We can use raw strings to write the above as:
str_view(r"(this\that)", r"(\\)")You can turn off regular expressions entirely in stringr with the fixed function:
str_view(x, fixed("."))It’s good to turn off regular expressions whenever you don’t need them, both to avoid mistakes and because they take longer to compute.
Anchors
By default, a regex will match anywhere in the string. If you want to force a match at specific place, use an anchor.
The beginning of string anchor is ^. It marks the beginning of the string, but doesn’t count as a character in the match.
For example, suppose we want to match an a at the beginning of the string:
x = c("abc", "cab")
str_view(x, "a")str_view(x, "^a")It doesn’t make sense to put characters before ^, since no characters can come before the beginning of the string.
Likewise, the end of string anchor is $. It marks the end of the string, but doesn’t count as a character in the match.
Character Classes
In regex, square brackets [ ] create a character class. A character class counts as one character, but that character can be any of the characters inside the square brackets. The square brackets themselves don’t count as characters in the match.
For example, suppose we want to match a c followed by either a or t:
x = c("ca", "ct", "cat", "cta")
str_view(x, "c[ta]")You can use a dash - in a character class to create a range. For example, to match letters p through z:
str_view(x, "c[p-z]")Ranges also work with numbers and capital letters. To match a literal dash, place the dash at the end of the character class (instead of between two other characters), as in [abc-].
Most metacharacters are literal when inside a character class. For example, [.] matches a literal dot.
A hat ^ at the beginning of the character class negates the class. So for example, [^abc] matches any one character except for a, b, or c:
str_view("abcdef", "[^abc]")Quantifiers
Quantifiers are metacharacters that affect how many times the preceeding character must appear in a match. The quantifier itself doesn’t count as a character in the match.
For example, the ? quantifier means the preceeding character can appear 0 or 1 times. In other words, ? makes the preceeding character optional.
For example:
x = c("abc", "ab", "ac", "abbc")
str_view(x, "ab?c")The * quantifier means the preceeding character can appear 0 or more times. In other words, * means the preceeding character can appear any number of times or not at all.
str_view(x, "ab*c")The + quantifier means the preceeding character must appear 1 or more times.
Quantifiers are greedy, meaning they always match as many characters as possible.
Groups
In regex, parentheses create a group. Groups can be affected by quantifiers, making it possible to repeat a pattern (rather than just a character). The parentheses themselves don’t count as characters in the match.
For example:
x = c("cats, dogs, and frogs", "cats and frogs")
str_view(x, "cats(, dogs,)? and frogs")Extracting Matches
Groups are espcially useful with the stringr functions str_match and str_match_all.
The str_match function extracts the overall match to the pattern, as well as the match to each group. So you can use str_match to split a string in more complicated ways than str_split, or to extract specifc pieces of a string.
For example, suppose we want to split an email address:
str_match("naulle@ucdavis.edu", "([^@]+)@(.+)[.](.+)")## [,1] [,2] [,3] [,4]
## [1,] "naulle@ucdavis.edu" "naulle" "ucdavis" "edu"